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Abstract 

This paper considers the robust and efficient implementation of Gaussian process regression 
with a Student-f observation model. The challenge with the Student-t model is the analyti- 
cally intractable inference which is why several approximative methods have been proposed. 
The expectation propagation (EP) has been found to be a very accurate method in many 
empirical studies but the convergence of the EP is known to be problematic with models 
containing non-log-concave site functions such as the Student-f distribution. In this paper 
we illustrate the situations where the standard EP fails to converge and review different 
modifications and alternative algorithms for improving the convergence. We demonstrate 
that convergence problems may occur during the type-II maximum a posteriori (MAP) 
estimation of the hyperparameters and show that the standard EP may not converge in 
the MAP values in some difficult cases. We present a robust implementation which relies 
primarily on parallel EP updates and utilizes a moment-matching-based double-loop al- 
gorithm with adaptively selected step size in difficult cases. The predictive performance 
of the EP is compared to the Laplace, variational Bayes, and Markov chain Monte Carlo 
approximations. 

Keywords: Gaussian process, robust regression, Student-t likelihood, approximate infer- 
ence, expectation propagation 



1. Introduction 

In many regression problems observations may include outliers which deviate strongly from 
the other members of the sample. Such outliers may occur, for example, because of failures in 
the measurement process or absence of certain relevant explanatory variables in the model. 
In such robust observation model is required. 
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Robust inference has been studied extensively. iDe Finettil (jl96ll ) described how Bayesian 
inference on the mean of a random sample, assuming a suitable observation model, naturally 
leads to giving less weight to outlying observations. However, in contrast to simple rejection 
of outliers, the posterior depends on all data but in the limit, as the separation between the 
outliers and the rest of the data increases, the effect of outliers becom es negligible. More 
theoretical results on this kind of outlier rejection were presented by iDawidI (jl973l ) who 
gave sufficient conditions on the observation model p{y\9) and the prior distribution p(6) of 
an unknown location parameter 9, which ensure that the posterior expectation of a given 
function m{9) tends to the prior as y — )• oo. He also stated that the Student-t distribution 
combined with a normal prior has this property. 

A more formal definition of robustness was given by lO'Hagan (|l979l ) in terms of an 
outlier-prone observation model. The observation model is defined to be outlier-prone of 
order n, if p{9\yi, ...,yn+i) — >■ p{9\yi, ...,yn) as yn+i oo. That is, the effect of a single 
conflicting observati on to the poster ior becomes asymptotically negligible as the observation 
approaches infinity. lO'Haga 3 (|l979h showed that the Student-t distribution is outlier prone 
of order 1, and that it can reject up to m outliers if there are at least 2m observations 
altogether. This contrasts heavily with the commonly used Gaussian observation model in 
which each observation infiuences the posterior no matter how far it is from the others. 
In the nonlinear Gaussian process (GP) regression context the outlier rejection is more 
complicated and one may consider the posterior distribution of the unknown function values 
fi = f{xi) locally near some input locations Xj. Depending on the smoothness properties 
defined through the prior on fi, m observations can be rejected locally if there are at least 2m 
data points nearby. However, already two confiicting data points can render the posterior 
distribution multimodal making the posterior inference challenging (these issues will be 
illustrated in the upcoming sections). 

In this work, we adopt the Student-t observation model for GP regression because of 
its good robustness properties which can be altered continuously from a very heavy tailed 
distribution to the Gaussian model with the degrees of freedom parameter. This allows the 
extent of robustness to be determined from the data through hype rparameter i nfere nce. The 
Student-t o bservation m odel was studied in linear regression by West ( 19841 ) and Geweke 
(Il993h . and iNeall (Il997h introduced it for GP regression. Other robust observation models 
which have been utilized in th e GP regression include, fo r exarnple, mixtures of Gaussians 
(|KussI . l2006l ; IStee:le et al.l. 120081). the Lapla. c e distribution (IKussl. 120061) . and in put dependent 
observation models ( Goldberg et al. . 19981 : Naish-Guzman and Holden . 20081 ). 

The challenge with the Student-t model is the analytically intractable inference. A com- 
mon approach has been to use the scale-mixtur e representation of the Stu dent-t distribution 
( Geweke . 19931 ). which enables Gibbs sampling (Geweke, 1993 : Neal 1997), and a fact o rizing 
variational approximati on (fVB) for the poster ior inference ([Tipping and Lawrencel . l2005l : 



KtissI . boOfil ). Recentl v IVanhatalo et aD (l2009l) comp ared the fVB with the Laplace ap- 



proximation (see, e.... iRasmussen and A^^^ ffl)) and showed that Laplace's methc;d 
provided slightly better predictive performance with less computational burden. They also 
showed that the fVB tends to underestimate the posterior uncertainties of the predictions 
because it assumes the scales and the unknown function values a posteriori independent. 
Another variation al approach called varia t ional bounds (VB) is available in the GPML 
software package ( Rasmussen and Nickisch . 2010l ). The method is based on forming an un- 
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normalized Gaussian low e r bou nd for each non-Gaussian likelihood term independently (see 
Nickisch and RasmussenI (j2008l ) for details and c omparisons in GP classif i cation ). Yet an- 



other related variational approach is described by Opper and Archambeau ( 20091 ) who stud- 
ied the Cauchy observation model (Student-i with degrees of freedo m 1). This method is sim- 
i lar to the KL-divergence minimization approach (KL) described by lNickisch and Rasmussen 
( 20081 ) a nd the VB approach can be rega rded as a special case of KL. The extensive compar- 
isons by iNickisch and RasmussenI (120081 ) in GP classification suggest that the VB provides 
better predictive performance than the Laplace approximation but worse mar ginal likelihood 
estimates than the KL or the expectation prqpaga- tion (EP) ( Minka . 2001bl ). According to 
the comparisons of INickisch and RasmussenI ()2008l ), EP is the method of choice since it is 
much faster than KL, at least in GP classification. The problem with the EP, however, is 
that th e Student-t likelihood is not log-concave which may lead to convergence problems 
(jSeegerl - lioOsh . 

In this paper, we focus on establishing a robust EP implementation for the Student-f 
observation model. We illustrate the convergence problems of the standard EP with simple 
one-dimensi qnal regressio n examp l es an d discuss how damping, fractio nal EP updates (or 
powe r EP) (jMinkal . l2004l : ISeegeii . |2005| ). and double-loop algorithms (jHeskes and Zoeteii . 



2002) can be used to improve the convergence. We present a robust i mplem entation which 



relies primarily on parallel EP updates (see e.g., Ivan Gerven et aP . l2009l ) and utilizes a 
moment-matching-based double-loop algorithm with adaptively selected step size to find 
stationary solutions in difficult cases. We show that the implementation enables a robust 
type-II maximum a posteriori (MAP) estimation of the hyperparameters based on the ap- 
proximative marginal likelihood. The proposed implementation is compared to the Laplace 
approximation, fVB, VB, and Markov chain Monte Carlo (MCMC) using one simulated and 
three real- world data sets. 



2. Gaussian Process Regression with Student-t Observation Model 

We will consider a regression problem, with scalar observations yi = /(xj) = 1, ...,n 

at input locations X = {xj}"^]^, and where the observation errors ei,...,e„ are zero-mean 
exchangeable random variables. The object of inference is the latent function /(x) : JR'^ — )• 3ft, 
which is given a Gaussian process prior 

/(x)|0~gP(m(x),A;(x,x'|^)), (1) 

where m(x) and A;(x, x' \9) are the mean and covariance functions of the process controlled by 
hyperparameters 6. For notational simplicity we will assume a zero mean GP. By definition, 
a Gaussian process prior implies that any finite subset of latent variables, f = {/(xj)}^]^, has 
a multivariate Gaussian distribution. In particular, at the observed input locations X the 
latent variables are distributed as p(f |X, 9) = A/'(f |0, Kf f), where Kf f is a covariance matrix 
with entries [Kf^f]jj = k(xi,Xj \9). The covariance function encodes the prior assumptions 
on the latent function, such as the smoothness and scale of the variation, and can be chosen 
freely as long as the covariance matrices which it produces are symmetric and positive semi- 
definite. An example of a stationary covariance function is the squared exponential 

fcsc(xi, Xj \9) = o-g^e exp - ^ ^,2 ^' — ' (2) 
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where 9 = {(Tge, /i, Zf^}, cjgg is a magnitude parameter which scales the overah variation 
of the unknown function, and Ik is a length-scale parameter which governs how fast the 
correlation decreases as the distance increases in the input dimension k. 

The traditional assumption is that given f the error terms are i.i.d. Gaussian: ei ~ 
7V(0,cj2). In this case, the marginal likelihood p(y|X,^, cr^) and the conditional posterior 
of the latent variables p(f |P, 0, cj^), where P = {y,X}, have an analytical solution. This 
is computationally convenient since approximate methods are needed only for the inference 
on the hyper parameters 9 and cr^. However, the limitation with the Gaussian model is its 
non-robustness. The robust Student-f observation model 

, , , r((z. + i)/2) , im-hfy^''^''^" 



where v is the degrees of freedom and a the scale parameter ( Gelman et al. . 20041 ) . is compu- 



tationally challenging. The marginal likelihood and the conditional posterior p{i\'D^9^a'^) 
are not anymore analytically tractable but require some method for approximate inference. 

3. Approximate Inference for the Student-t Model 

In this section, we review the approximate inference methods considered in this paper. First 
we give a short description of the MCMC and the Laplace approximation, as well as the 
two variational methods, fVB and VB. Then we give a more detailed description of the EP 
algorithm and review ways to improve the convergence in more difficult problems. 

3.1 Markov Chain Monte Carlo 

The MCMC approach is based on drawing samples from p(f , 0, o"^, and using these 
samples to represent the posterior distribution and to numerically approximate integrals 
over the latent variables and the hyperparameters. Instead of implementing a Markov chain 
sampler directly for the Student-t model a more common approach is to use the Gibbs 
sampling based on the following scale mixture representation of the Student-t distribution 

yi\fi-M{h,Vi) 

Vi^lnv-x\y,(J^), (3) 



wher e each observation has its own Inv-x -distributed noise variance Vi (iNeal 119971 : iGelman et al 



20041 ) . Sampling of the hyperparameters 9 can be done with any genera ,l sampling algorithrn 



such as the Slice sampling or the hybrid Monte Carlo (HMC) (see, e.g.. lGelman et al.l . l2004l ). 
The Gibbs sampler on the scale mixture ([3]) converges often slowly and may get stuck for 
long times in small values of <t^ because of the dependence between Vi and cr^. This can 
be avoided by re-par ameterization Vi = a'^Ui, where Ui ~ Inv-x^(j^, t^), and loga^ oc 1 
dCelman et all . l2004l ). This improves mixing of the chains and reduces the autocorrelations 
but introduces an implicit prior for the scale parameter cj^ = a'^r'^ of the Student-t model. 

3.2 Laplace Approximation 

The Laplace approximation for the conditional posterior of the latent function is constructed 
from the second order Taylor expansion of logp(f | P, 9, cr^, u) around the mode f, which gives 
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a Gaussian approximation to the conditional posterior 

p(f I V, 9, a\ v) ^ q{i \V, 9, a\ v) = M{i |f , Sla), (4) 

where f = arg maXf p(f 9, a"^, v) ( Rasmussen and Winian3 . 20061 ) . is the Hessian of 

the negative log conditional posterior at the mode, that is, 

-VV logp(f \V, e, a\ v)L, = K-fif +W, (5) 



where W is a diagonal matrix with entries W,, = ^ \ogp{y\fi,a'^ ,v)\ 

The inference in the hyperparameters is conducted by doing a Laplace approximation to 
the marginal likelihood p(y|X, ^, cj^, z^) and searching for the maximum a posterior (MAP) 
estimate for the hyperparameters 

{0, o"^, z>} = arg max log q{9, a"^, = arg max [log g(y|X, 9, cr^, v) + \ogp{9, cr^, v)\ , 



where p{9,a'^,v) is the prior of the hyperparameters. The gradients of the approximate 
log marginal likelihood can be solved analytically, which enables the MAP estimation of the 
hyper parameters with gradient based optimization methods. Following IWilliams and Barber 
(|l998h i;he approximation scheme is called the Laplace method , but essentially the same 
approach is named Gaussian approximation by Rue et al. ( 20091 ) in their Integrated nested 
Laplace approxirnation (INLA) so ftware package for Gaussian Markov random field models 
( Vanhatalo et al. . 20091 ). (see also Tierney and Kadane . 19861 ). 

The implementation of the Laplace algorithm for this particular model requires care since 
the Student-t likelihood is not log-concave and thus p({ I'D , 6 , , v) may b e multimodal. 
The standard implementation presented bv iRasmussen and Williams j2006h requires some 
modifications which are discussed in detail by Vanhatalo et al.l ~ 2009l V Later on Hannes 
Nickisch proposed a sligthly different implementation (personal communication) , wh ich is 
at the moment in use in th; GPML software package asmussen and Mckiscd . B . 



3.3 Factorizing variational approximation (fVB) 

The scale-mixture decomposition ^ enables a computationally convenient variational ap- 
proximation if the latent values f and the residual variance terms V = [Vi,...,Vn 
assumed a posteriori independent: 



are 



qi{,V)=q{{)llq{Vi). 



(6) 



i=l 



This k ind of factorizing variational approximation was introduced bv lTipping and Lawrence 
jiool) ■ io form a robust observation model for linear models within the relevance vector ma- 
chine framewor k. For robust Gaussian process regression with the Student-t likelihood it 
applied by IKussI (|2006l ) and essentially the same variational approach has also been 



was 



used for approxima te inference on linear inodels with the automatic relevance determina- 
tion prior (see e.g.. Tipping and Lawrencel . boOSi ) . Assuming the factorizing posterior ([6]) 
and minimizing the KL-divergence from g(f, V) to the true posterior p{i,\'\T),6,a'^,u) re- 
sults in a Gaussian approximation for the latent values, and inverse-x^ (or equivalently 
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inverse gamma) approximations for the residual variances Vi. The parameters of q{i) and 
q{Vi) can be estimated by maximizing a variational lower bound for the marginal likelihood 
p(y \X, 6, £7^, u) with an expectation maximization (EM) algorithm. In the E-step of the al- 
gorithm the lower bound is maximized with respect to g(f) and q{Vi) given the current point 
estimate of the hyperparameters and in the M-step a new estimate of the hyperparameters 
is determined with fixed g(f) and q{Vi)- 

The drawback with a factorizing approximation determined by minimizing the reverse 



KL-d iv ergence is that it t ends t o underestimate the posterior uncertainties (see e.g.. iBishopl . 



2nOfil ). IVanhatalo eTID (120091 ) compared fVB with the previously described Laplace and 



MCMC approximations, and found that the fVB provided worse predictive variance esti- 
mates compared to the Laplace approximation. In addition, the estimation of v based on 
maximizing the variational lower bound was found less robust with the fVB. 

3.4 Variational bounds (VB) 



This v ariational bounding method was introduced for binary GP classification by lGibbs and MacKay 



(120001) and comparisons to other ap proximative methods for GP classification can be found 



in (iNickisch and Rasmussenl . l2008l ) . The method is based on forming a Gaussian lower 



bound for each likelihood term independently: 

piVilf,) > exp(-/2/(270 + bdi - h{j,)/2), 

which can be used to construct a lower bound on the marginal likelihood: p(y | X, 6, v, a) > 
ZyB- With fixed hyperparameters, 7^ and bi can be determined by maximizing Zvb to obtain 
a Gaussian approximation for the latent values p(f \ P, 9, o"^) and an approximation for the 
marginal likelihood. With the Student-t likelihood only the scale parameters 7j need to be 
optimized because the location parameter is determined by the corresponding observations: 
hi = Vi/^i- Similarly to the Laplace approximation and EP, MAP-estimation of the hyperpa- 
rameters can be done by optimizing Zvb with gradient bas ed methods. In our exper i ments 
we used the implementation available in the GPML-package (jRasmussen and Nickisch . l2oiol ;i 



augmented with the same hyperprior definitions as with the other approximative methods. 
3.5 Expectation Propagation 

The EP algorithm is a general me thod for approximating integrals over functions that factor 
into simple terms ( Minka . 200 Ibl ). It approximates the conditional posterior with 



1 " 
^EP 

where Zep « p(y | X, 0, a^, z^), S = (Kjif + S^V, = 5^ A, ^ = diagfaf, a^]^ 
and fi = [/ii, ...,/!„] In ([7|) the likelihood terms p{yi\fi,a'^ ,v) are approximated by un- 
normalized Gaussian site functions ii{fi\Zi, (li,af ) = ZiM{fi\fli,af). 

The EP algorithm updates the site parameters Zj, /ij and af and the posterior approx- 
imation ([7]) sequentially. At each iteration (i), first the i'th site is removed from the i'th 
marginal posterior to obtain a cavity distribution 

q.i.{fi)^q{n\V,e,a^u)ii{n)~\ (8) 
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Then the z'th site is replaced with the exact likelihood term to form a tilted distribution 
Piifi) = ^j^^Q-i{fi)p{yi\fi) which is a more refined non-Gaussian approximation to the 
true z'th marginal distribution. Next the algorithm attempts to match the approximative 
posterior marginal q{fi) with Pi{fi) by finding first a Gaussian qi{fi) satisfying 

Wi)=^{fi\(^h^f) = argminKL(pi(/i)||gi(/i)) , (9) 

which is equivalent to matching fli and af with the mean and variance of Then the 

parameters of the local approximation ti are updated so that the moments of q{fi) match 
with q{fi): 

q{h\V,e,a\u) cc = ZiM{h\h,d^h- (10) 

Finally, the parameters and 5] of the approximate posterior ([7]) are updated according to 
the changes in site tj. These steps are repeated for all the sites at some order until conver- 
gence. Since only the means and variances are needed in the Gaussian moment matching 
only Jli and df need to be updated during the iterations. The normalization terms Zi are 
required for the marginal likelihood approximation Zep ~ p(y | X, ^, o"^, z^) which is com- 
puted after converge of the algorithm, and they can be determined by integrating over /, in 
equation ([lO]) which gives Zi = Zi{J q_i{fi)M{fi\Jli,df)dfi)~'^. 

In the traditional EP algorithm (from now on referred to as the sequential EP), the pos- 
terior approximation ([7]) is updated sequentially after each moment matching (llOp . Recently 
an alternative parallel update scheme has been u tilized especially in models with a very large 



number of unknowns (see e.g. . Ivan Gerven et al.l ([2009)). In the parallel EP the site updates 



are calculated with fixed posterior marginals ^ and diag(5]) for all ti, i = 1, ...,n, in paral- 
lel, and the posterior approximation is refreshed only after all the sites have been updated. 
Although the theoretical cost for one sweep over the sites is the same {0{n^)) for both the 
sequential and the parallel EP, in practice one re-computation of 5] using Cholesky decom- 
position is much more efficient than n sequential rank-one updates. In our experiments, the 
number of sweeps required for convergence was roughly the same for both schemes in easier 
cases where the standard EP converges. 

The marginal likelihood approximation is given by 

1 1 -1 " 

logZEP = --log|Kf,f + S|--/iT(|Kf,f + s) iL + Y,^ogZi{cj\v) + C^Y>, (11) 

i=l 

where Cep = — ^log(27r) — log J q-i{fi) J\f{fi\jli, d'f)dfi collects terms that are not ex- 
plicit functions of 6, or v. If the algorithm has converged, that is, Pi{fi) is consistent 
(has the same means and variances) with q{fi) for all sites, Cep, S and /i c an be consid' 



ered constants when differ entiating (jlip with respect to the hyperparameters (jSeegerj, l2005l : 



Opper and Wintheii . l2005l ) . This enables efficient MAP estimation with gradient based op- 
timization methods. 

There is no guarantee of convergence for either the sequential or the parallel EP. When 
the likelihood terms are log-concave and the ap proximation is initialized to th e prior, the 
algorithm converges fine in many cases (see e.g., Nickisch and Rasmussen . 20081 ). However, 



with a non-log-concave likelihood such as the Student-t model, convergence problems may 
arise and these will be discussed in section [S] The convergence can be improved either by 
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damping the EP updates (IMinka and Laffertvl . |2002| ) or by using a robust but slower double- 
loop algorithm ([Heskes and Zoeteii . |2002| ) . In damping the site parameters in their natural 
exponential forms, fj = a~ and Di = p,i, are updated to a convex combination of the 
old and proposed new values, which results in the following update rules: 



Afi = 6{a-^ - a- and Ai>i = S{a- fii - a- m), 



-2 



;i-2 ; 



(12) 



where 5 G (0, 1] is a step size parameter controlling the amount of damping. Damping can 
be viewed as using a smaller step size within a gradient-based sea rch for saddle points of th e 
same objective function as is used in the double- loop algorithm (jHeskes and Zoeteii . |2002| ) . 



3.6 Expectation Propagation, the double-loop algorithm 

When either the sequential or the parallel EP does not converge one may still find approx- 
imations satisfying the moni e nt ma tching conditions (llOp by a double loop algorithm. For 
example, iHeskes and Zoeten (j2002l ) present simulation results with linear dynamical sys- 
tems where the double loop algorithm is able to find more accurate approximations when 
the damped EP fails to converge. For the model under consideration, the fixed points of 
the EP algorit hm correspond to the stationary points of the following objective function 
dMinkal . lioOl^ 



rnininax- ^ log J p{yi\fi) exp (^u_ifi - T-i dfi - log J p(f) fl exp (^i>ifi - r^y j df 
+ log J exp (^Usji -Ts^Y^ dfi 



(13) 



where A_ = {z^_j,r_j}, A = {i>j,fj}, and Ag = {i^Si,Ts^} are the natural parameters of 
the cavity distributions q-i{fi), the site approximations ti{fi), and approximate marginal 
distributions qs^ifi) = M{t~^Vs^,t~/^) respectively. The min-max problem needs to be solved 
subject to the constraints i>j = Ug^ — V-i and fj = r^. — r_j, which resemble the moment 
matching conditions in (llOp . The objective function in (I13p is equal to the log^Ep defined 
in ([7]) and is also equivalent to the exp ectation consistent (EC) free energy approximation 
presented bv lOpper and Wintheii (j2005l ) . A unifying view of the EC and EP approximat ions 
as well as the connection to the Bethe free energies is presented by Heskes et al. ( 20051 ). 

Equation (jl3p suggests a double-loop algorithm where the inner loop consist of maxi- 
mization with respect to A_ with fixed Ag and the outer loop of minimization with respect 
to Xg . The inner maximization affects only the first two terms and ensures that the marginal 
moments of the current posterior approximation q(f ) are equal to the moments of the tilted 
distributions p{fi) for fixed A^. The outer minimization ensures that the moments qsXfi) 
are equal to marginal moments of (?(f). At the convergence, q{fi), p{fi), and qs^ifi) share 
the same moments up to the second order. If p{yi\fi) are bounded, the objective is bounded 
from below and conseq uently t here exists stationary poin t s sat isfying these expectation 
consistency constraints ( Minka . 2001a : Opper and Winther . 20051 ). In the case of multiple 
stationary points the solution with the smallest free energy can be chosen. 

Since the first two terms in (I13p are concave functions of A_ and A the inner maximization 
problem is concave with respect to A_ (or equivalently A) after substitution of the constraints 



8 



Gaussian Process Regression with a SxuDENx-i Likelihood 



A = A^^ - A_ (lOpper and Wintheil . lioosh . The Hessian of the first term with respect to A_ is 
weh defined (and negative semi-definite) only if the tilted distributions p{fi) oc p{yi\fi)q-i{fi) 
are proper probability distributions with finite moments up to the fourth order. Therefore, 
to ensure that the product of q-i{fi) and the Student-t site p{yi\fi) has finite moments and 
that the inner-loop moment matching remains meaningful, the cavity precisions T-i have to 
be kept positive. Furthermore, since the cavity distributions can be regarded as estimates 
for the leave-one-out (LOO) distributions of the latent values, r_j = would correspond to 
a situation where g(/i|y_j,X) has infinite variance, which does not make sense given the 
Gaussian prior assumption ([T]). On the other hand, fj may become negative for example 
when the corresponding observation yi is an outlier (see section [5]). 



3.7 Fractional EP updates 

Fractional EP (or power EP. lMinkal . [20041 ) is an extension of EP which can be used to reduce 



the computational complexity of the algorithm by simplifying the tilted moment evaluations 
and to ir nprove the ro bustness of the algorithm when the approximation family is not fiexible 
enough (IMinkal. l2005l) or w hen the propagation of information is difficult due to vague prior 
information ( Seegeil . 20081 ) . In the fractional EP the cavity distributions are defined as 



q-i{fi) oc q{fi\'D,9,i^,a'^)/ii{fi)'' and the tilted distribution as Pi{fi) oc q~i{fi)p{yi\fi)'^ for 
a fraction parameter r/ G (0,1]- The site parameters are updated so that the moments 
of q-i{fi)ti{fi)^ oc q{fi) match with q-i{fi)p{yi\fi)^ ■ Otherwise the procedure is similar 
and the standard EP can be recovered by setting rj = 1. In the fractional EP the natural 
parameters of the cavity distribution are given by 



T-i = (Ji — rjfi and v. 
and the site updates (with damping factor 5) by 



Afi = 5r] ^((Tj ^ - ^) and APj = 5r] ^{a- '^jli - a- "^fii 



(14) 



(15) 



The fractional update step miug KL(j5j(/j)| |(7(/j)) can be v iewed as mini mization of an- 
other divergence measure called the a-divergence with a = r] (|Minkal . [2005h . Compared to 
the KL-divergence, minimizing the a-divergence with < a < 1 does not force q{fi) to cover 
as much of the probability mass of p{fi) whenever p{fi) > 0. As a consequence, the fractional 
EP tends to underestimate the variance and normalization constant of q-i{fi)p{yi\fi)^ , and 
also the approximate marginal likelihood Zep- On the other hand, we also found that min- 
imizing the KL-divergence in the standard EP may overestimate the marginal likelihood 
with some data sets. In case of multiple modes, the approximation tries to represent the 
overall uncertainty in Pi{fi) the more exactly the closer a is to 1. In the limit a — )• the 
reverse KL-divergen ce is obtained which is used in s ome form, for example, in the fVB and 
KL approximations ( Nickisch and Rasmussen . 20081 ). Also the double- loop objective func- 
tion (fT3l) can be modified according to the diff e rent d ivergence measure of the fractional EP 
(|Cseke and Heskesl . I2OI1I : ISeeger and Nickischl . [20111 ). 

The fractional EP has some benefits over the standard EP with the non-log-concave 
Student-f sites. First, when evaluating the moments of g_j(/j)p(yj|/j)'', setting r] <\ flattens 
the likelihood term which alleviates the possible converge problems related to multimodality. 
This is related to the approximating family being too inflexible and the beneflts of different 
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divergence measures in these cases are considered bv iMinkal (j2005l ). Second, the fractional 
updates help to avoid the cavity precisions becoming too small, or even negative. By choosing 
rj < 1, a fraction (1 — r/)fj of the precision of the i:th. site is left in the cavity. This decreases 
the cavity variances which in turn makes the tilted moment integrations numerically more 
robust. Problems related to cavity precision becoming too small cai i be pr e sent a lso with 
log-concave sites when the prior information is vague. For example, ISeeger (|2008h reports 
that with an underdetermined linear model combined with a log-concave Laplace prior the 
cavity precisions remain positive but they may become very small which induces numerical 
inaccuracies in the analytical mome nt eval u ations . These inaccuracies may accumulate and 
even cause convergence problems. ISeeger (|2008h reports that fractional updates improve 
numerical robustness and convergence in such cases. 



4. Robust implementation of the parallel EP algorithm 

The sequential EP updates are shown to be stable for models i n which the e xact site terms 
(in our case the likelihood functions p{yi\fi)) are log-concave (ISeeged . l2008l ) . In this case, 
all site variances, if initialized to non-negative values, remain non-negative during the up- 
dates. It follows that the variances of the cavity distributions are positive and 
thus also the subsequent moment evaluations of Q-i{fi)piyi\fi) are numerically robust. The 
non-log-concave Student-i likelihood is problematic because both the conditional posterior 
p(f I P, 0, z^, o") as well as the tilted distributions Pi{fi) may become multimodal. Therefore 
extra care is needed in the implementation and these issues are discussed in this section. 

The double-loop algorithm is a rigorous approach that is guaranteed to converge to a 
stationary point of the objective function (|13p when the site terms p{yi\fi) are bounded from 
below. The downside is that the double-loop algorithm can be much slower than for example 
the parallel EP because it spends much computational effort during the inner loop iterations, 
especially in the early stages when Qsiifi) are poor approximations for the true marginals. 
An obvious improvement would be to start with damped parallel updates and to continue 
with the double-loop method if necessary. Since in our experiments the parallel EP has 
proven quite efficient with many easier data sets, we adopt this approach and propose few 
modifications to improve the conver gence in difficult case s . A p arallel EP initialization and 
a double-loop backup is also used by lSeeger and NickischI (1201 ll ) in their fast EP algorithm. 

The parallel EP can also be interpreted as a variant of the double-loop algorithm where 
only one inner-loop optimization step is done by moment matching (llOp and each such update 
is followed by an outer-loop refinement of the marginal approximations Qsiifi)- The inner- 
loop step consists of evaluating the tilted moments {/ij, af\i = 1, n} with Qsiifi) = q{fi) = 



J\f{tii,'Sii), updating the sites p2|) . and updating the posterior ([7]). The outer-loop step 
consists of setting Qsiifi) equal to the new marginal distributions q{fi)- Connections between 
the message passing updates and the double-loop methods together with considerations of 
different s earch directions for the inner- loop optimization can be found in the extended 
version of ( Heskes and Zoeter . 20021 ). The robustness of the parallel EP can be improved by 
the following modifications. 



1. After each moment matching step check that the objective (|13p increases. If the 
objective does not increase reduce the damping coefficient 5 until increase is obtained. 
The downside is that this requires one additional evaluation of the tilted moments 
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for every site per iteration, but if these one-dimensional integrals are implemented 
efficiently this is a reasonable price for stability. 

2. Before updating the sites (jl2p check that the new cavity variances r_j = r^. — (fj + Afj) 
are positive. If they are negative, choose a smaller damping factor 6 so that T_j > 0. 
This computationally cheap precaution ensures that the increase of the objective (|13p 
can be verified according to the modification 1. 

3. With modifications 1 and 2 the site parameters can still oscillate (see section [5] for 
an illustration) but according to our experiments the convergence is obtained with all 
hyperparameters values eventually. The oscillations can be reduced by updating qg- (fi) 
only after the moments oi p{fi) and q{fi) are consistent for all « = 1, ...,n with some 
small tolerance, for example 10~^. Actually, this modification results in a double- loop 
algorithm where the inner- loop optimization is done by moment matching (jlOp . If no 
parallel initialization is done, often during the first 5-10 iterations when the step size 
5 is limited according to the modification 2, the consistency between p{fi) and q{fi) 
cannot be achieved. This is an indication of g(f) being too inflexible for the tilted 
distributions with the current Qsiifi)- An outer-loop update qsiifi) = <l{fi) usually 
helps in these cases. 

4. If no increase of the objective is achieved after an inner-loop update (modiflcation 1), 
utilize the gradient information to obtain a better step size 5. The gradients of (|13p 
with respect to the site parameters Vi and can be calculated without additional 
evaluations of the objective function. With these gradients, it is possible to determine 
the gradient of the objective function with respect to 5 in the current search direction 
defined by (jl2p . For example, cubic interpolation with derivative constraints at the 
end points can be used to approximate the objective as a function of 5 with fixed site 
updates Af, = — (T~^ and AzJ, = <T~'^fii — o'^'^Hi for i = 1, ...,n, from which a 
better estimate for the step size 6 can be determined efficiently. 

In the comparisons of section [6] we start with 10 damped (6 = 0.8) parallel iterations 
because with a sensible hyperparameter initialization this is enough to achieve convergence 
in most hyperparameter optimization steps with the empirical data sets. If no convergence 
is achieved this parallel initialization also speeds up the convergence of the subsequent 
double-loop iterations (see section [512]) . If after any of the initial parallel updates the pos- 
terior covariance S becomes ill-conditioned, i.e., many of the fj are too negative, or any 
of the cavity variances become negative we reject the new site configuration and proceed 
with more robust updates utilizing the previously described modifications. To reduce the 
computational costs we limited the maximum number of inner loop iterations (modifica- 
tion 3) to two with two possible additional step size adjustment iterations (modification 4). 
This may not be enough to suppress all oscillations of the site parameters but in practice 
more frequent outer loop refinements of Qsiifi) were found to require fewer computationally 
expensive objective evaluations for convergence. 

In some rare cases, for example when the noise level a is very small, the outer-loop 
update of Qsiifi) may result in negative values for some of the cavity variances even though 
the inner-loop optimality is satisfied. In practise this means that is smaller than fj for 

some i. This may be a numerical problem or an indication of a too inflexible approximating 
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family but switching to fractional updates helps. However, in our experiments, this happened 
only when the noise level was set to too small values and with a sensible initialization such 
problems did not emerge. 

4.1 Other implementation details 

The EP updates require evaluation of moments = f ffgi{fi)dfi for k = 0,1,2, where 
we have defined gi{fi) = Q~i{fi)piyi\fi)^ ■ With the Student-t likelihood and an arbitrary 
rj £ (0, 1] numeric al integrat i on is r equired. We used the adaptive Gauss-Kronrod quadra- 



ture described by IShampind (j2008l ) and for computational savings calculated the required 
moments simultaneously using the same function evaluations. The integrand Pi{fi) may have 
one or two modes between the cavity mean and the observation y^. In the two- modal 
case the first mode is near and the other near /Uqo = '^^(cZj M_j -|- r]ia~'^yi), where 
and o"^ = {cZf + rjia~'^)~^ correspond to the mean and variance of the limiting Gaussian 
tilted distribution as — )• oo. The integration limits were set to min(/^_i — 6c7_j, /Xqo — 6(7 oo) 
and max(/i_j + 6(T_i,^oo + Scqo) to cover all the relevant mass around the both possible 
modes. 

Both the hyperparameter estimation and monitoring the convergence of the EP requires 
that the marginal likelihood q{y \ X, 9, o"^, can be evaluated in a numerically robust man- 
ner. Assuming a fraction parameter rj the marginal likelihood is given by 

I ^ / 1 1 1 \ 

log Zep =- ^ ( log + - log t.^tzI + - -T-'ul 

-ilog|I + Kf,fE~'|-^i>^M, (16) 

where i/^. = v^i -|- r]Vi and Tg. = T_j -|- r/fj. The first sum term can be evaluated safely if 
the cavity precisions r_j and the tilted variances o"? remain positive during the EP updates 
because at convergence r^- = 

Evaluation of 1 1 -|- Kf^f S ^ | and S = (Kf^f^^ -|- 1] "'^)^^ needs some care because many 

of the diagonal entries of 1] ^ = diagf ri , fn] may become neg a tive d ue to outliers and thus 



the standard approach presented by iRasmussen and Williamsl (120061) is not suitable. O ne 
option is to use the rank one Cholesky updates as described by Vanhatalo et al. ( 20091 ) or 



the LU decomposition as is done in the GPML implementation of the Laplace approximation 
( Rasmussen and Nickisch . 2O10l ). In our parallel EP implementation we process the positive 
and negative sites separately. We define Wi = diag(f^^^^) for fj > and W2 = diag(|fj|^/^) 
for fj < 0, and divide Kf f into corresponding blocks Kn, K22, and K12 = KJ^. We compute 
the Cholesky decompositions of two symmetric matrices 

LiLJ" = I-FWiKiiWi and L2LJ = I -W2(K22 - U2Uj)W2, 

where U2 = K2iWiLj~^. The required determinant is given by 1 1 -|- Kf^f S ^ | = |Li|^|L2|^. 
The dimension of Li is typically much larger than that of L2 and it is always positive 
definite. L2 may not be positive definite if the site precisions are too negative, and therefore 
if the second Cholesky decomposition fails after a parallel EP update we reject the proposed 
site parameters and reduce the step size. The posterior covariance can be evaluated as 
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S = Kf,f -UU'^ + VV'^, where U = [Kn, K12] ^ WiLj"'^ and V = [K12, K22] ^ W2L^'^ - 
UUj W2L2 The regular observations reduce the posterior uncertainty through U and 
the outhers increase uncertainty through V. 

5. Properties of the EP with a Student-t hkehhood 

In GP regression the outher rejection property of the Student-t hkehhood depends heavily 
on the data and the hyperparameters. If the hyperparameters and the resulting unimodal 
approximation ([7]) are suitable for the data there are usually only a few outliers and there 
is enough information to handle them given the smoothness assumptions of the GP prior 
and the regular observations. This is usually the case during the MAP estimation if the 
hyperparameters are initialized sensibly. Unsuitable hyperparameters values, for example a 
small 1/ combined with a too small a and a too large lengthscale, can result into a very large 
number of outliers because the model is unable to explain large quantity of the observations. 
This may not necessarily induce convergence problems for the EP if there exists only one 
plausible posterior hypothesis capable of handling the outliers. On the other hand, if the 
conditional posterior distribution has multiple modes convergence problems may occur unless 
sufficient amount of damping is used. In such cases the approximating family ([7]) may not 
be fl exible enough but different divergence measures (fractional updates with 77 < 1) can 
help (jMinka . In this section we discuss the convergence properties of the EP and also 



compare its approximation to the other methods described in the section 3. 

An outlying observation yi increases the posterior uncertainty on the unknown function 
at the input space regions a priori correlated with Xj. The amount of increase depends 
on how far the posterior mean estimate of the unknown function value, E(/i|P), is from 
the observation yi. Some insight into this behavior is obtained by considering the negative 
Hessian of logp{yi\fi, v, o"^), i.e., Wi = — Vj, logp(yi|/i), as a function of /j (compare to the 
Laplace approximation in section 3.2). Wi is non- negative when yi — a\fii < fi < yi + 
attains its minimum when fi = yii: a^/Sv and approaches zero as — >• 00. Thus, with the 
Laplace approximation satisfying fi — a^/v < yi < fi + o^/v can be interpreted as regular 
observations because they decrease the posterior covariance Yl in equation ([5]). The rest 
of the observations increase the posterior uncertainty and can therefore be interpreted as 
outliers. Observations that are far from the mode fi are clear outliers in the sense that they 
have very little effect on the posterior uncertainty. Observations that are close to fi it aV3i> 
are not clearly outlying because they increase the posterior uncertainty the most. The most 
problematic situations arise when the hyperparameters are such that many fi are close to 
yi^a^/Su. However, despite the negative Wjj, the covariance matrix Sla is positive definite 
if f is a maximum of the conditional posterior. 

The EP behaves similarly as well. If there is a disagreement between the cavity distri- 
bution g_j(/j) = J\f{fi-i,(j'^i) and the likelihood p{yi\fi) but the observation is not a clear 
outlier, the tilted distribution is two-modal and the moment matching (jlOp results in an 
increase of the marginal posterior variance, af > af, which causes fj to decrease (I12p and 
possibly become negative. The sequential EP usually runs smoothly when there's a unique 
posterior mode and clear outliers. The site precisions corresponding to the outlying obser- 
vations may become negative but their absolute values remain small compared to the site 
precisions of the regular observations. 



13 



Pasi Jylanki, Jarno Vanhatalo and Aki Vehtari 




Figure 1: The upper row: Two one-dimensional regression examples, where the standard 
EP may fail to converge with certain hyperparameter values, unless damped suf- 
ficiently. The EP approximations obtained by both the regular updates rj = 1 
(EP) and the fractional updates rj = 0.5 (fEP) are visualized. The lower row: 
comparison of the approximative predictive distributions of the latent value /(x) 
at x = 2. With MCMC all the hyperparameters are sampled and for all the 
other approximations (except fVB in example 2, see the text for explanation) the 
hyperparameters are fixed to the corresponding MAP estimates. Notice that the 
MCMC estimate of the predictive distribution is unimodal in example 1 and mul- 
timodal in example 2. With smaller lengthscale values the conditional posterior 
p(f I T>, 9) can be multimodal also in example 1. 



5.1 Simple regression examples 

Figure [T] shows two one-dimensional regression problems in which the standard EP may run 
into problems. In example 1 (the upper left panel) there are two outliers yi and y2 providing 
conflicting information in a region with no regular observations (1 < x < 3). If and cj^ 
are sufficiently small, so that the likelihood p{yi\fi) is narrow as a function of fi, and the 
length-scale is small inducing small correlation between inputs far apart, there is significant 
posterior uncertainty about the unknown f{x) when 1 < x < 3 and the true posterior is 
multimodal. The Student-t distribution is able to reject up to m outliers locally, in some 
neighborhood of an input x, if there are at least 2m observations in that neighborhood. The 
size of the neighborhood depends on the smoothness properties of the GP prior (governed 
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by the length-scale). In example 1, we have two observations locally which both conflict 
with q{fi, f2\y3, ■■■,yn) almost equally much and neither one can be labeled as an outlier 
nor a regular observation. Contrary to the clearly outlying observations for which fj < 0, 
at convergence the site precisions fi and f2 (corresponding to yi and 2/2) get small positive 
values, that is, these observations decrease the posterior uncertainty despite being outliers. 
If the site updates are not damped enough fi or 72 may become negative and cause stability 
problems. 

If the length-scale is sufficiently large in the example 1, the GP prior forces the posterior 
distribution to become unimodal and both the sequential and the parallel EP converge. 
This is also the case when the hyperparameters are fixed to their MAP estimates (assuming 
noninformative priors). The corresponding predictive distribution is visualized in the upper 
left panel of Figure [T] showing a considerable increase in the posterior uncertainty when 1 < 
X < 3. The lower left panel shows comparison of the predictive distribution of /(x) at x = 2 
obtained with the different approximations described in the section 3. The hyperparameters 
are estimated separately for each method. The smooth MCMC estimate is calculated by 
integrating analytically over the latent vector f for each posterior draw of the residual 
variances V and averaging the resulting Gaussian distributions g(/|x,V,0). The MCMC 
estimate (with integration over the hyperparameters) is unimodal but shows small side 
bumps when the latent function value is close to the observations yi and y2- The standard 
EP estimate (EPl) covers well the posterior uncertainty on the latent value but both the 
Laplace and fVB underestimate it. At the other input locations where the uncertainty is 
small, all methods give very similar estimates. 

The second one-dimensional regression example, visualized in the upper right panel of 
Figure [H is otherwise similar to the example 1 except that the nonlinearity of true function 
is much stronger when — 5 < x < 0, and the observations yi and 1/2 are closer in the input 
space. The stronger nonlinearity requires a much smaller length-scale for a good data fit and 
the outliers yi and 7/2 provide more confiicting information (and stronger multimodality) due 
to the larger prior covariance. The lower right panel shows comparison of the approximative 
predictive distributions of /(x) when x = 2. The MCMC estimate has two separate modes 
near the observations yi and 7/2- The Laplace and fVB approximations are sharply localized 
at the mode near yi but the standard EP approximation (EPl) is very wide trying to preserve 
the uncertainty about the both modes. Contrary to the example 1, also the conditional 
posterior q{f \ 'D,6,i/,a) is two- modal if the hyperparameters are set to their MAP-estimates. 

Next we discuss the problems with the standard EP updates with the help of the example 
1. Figure [2] illustrates a two-dimensional tilted distribution of the latent values fi and /2 
related to the observations yi and 1/2 in the example 1. A relatively small lengthscale 
(0.9) is chosen so that there is large uncertainty on fi and /2, but still quite strong prior 
correlation between them. Suppose that all other sites have already been updated once with 
the undamped sequential EP starting from a zero initialization (fj = and z>j = for i = 
1, n). Panel (a) visualizes the 2-dimensional marginal approximation q{fi, /2|y3, • • • , yn) 
together with the joint likelihood p{yi,y2\fi, f2) = P{y2\f2)p{y2\f2), and panel [(b)] shows the 
contours of the resulting two dimensional tilted distribution which has two separate modes. 
If the site ti(/i) is updated next in the sequential manner with no damping, fi will get a 
large positive value and the approximation q{fi,f2) fits tightly around the mode near the 
observation yi. After this, when the site t2(/2) is updated, it gets a large negative precision. 
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(a) q(fij2\ys,-,yn) (b) q{h, h\yi, -^yn) (c) q{fi, h\ys, ■■■,yn) and (d) q{fi, f2\y3, ...,yn) 
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Figure 2: An illustration of a two-dimensional tilted distribution related to the two prob- 
lematic data points yi and y2 in the example 1. Compared to the MAP value 
used in the Figure [U shorter lengthscale (0.9) is selected so that the true 
conditional posterior is multimodal. Panel (a) visualizes the joint likelihood 
P{yi\fi)p{y2\f2) together with the approximate marginal q{fi, f2\y3, ■■■,yn) ob- 
tained by one round of undamped sequential EP updates on sites ti{fi), for 
i = 3, ...,n. Panel (b) visualizes the corresponding two-dimensional tilted dis- 
tribution p(/i,/2) oc q{fi, f2\y3, ■■■,yn)piyi\fi)p{y2\f2)- Panels (c) and (d) show 
the same with only a fraction rj = 0.5 of the likelihood terms included in the tilted 
distribution, which corresponds to fractional EP updates on these sites. 



f2 < 0, since the approximation needs to be expanded toward observation y2 which is not at 
this stage classified as a clear outlier because of the vague prior information. It follows that 
during the second EP sweep site 1 can no longer be updated because the cavity precision 
r_i = a^'^ — fi is negative. This happens because there are no other data points supporting 
the current posterior solution, that is, there are no other sites with positive precision nearby, 
and the site 2 with negative precision reduces the current marginal precision cjj^^ too much. 
If the EP updates were done in parallel, both the cavity and the site precisions would be 
positive after the first posterior update, but q{fi, /2) would be tightly centered between the 
modes. After a couple of parallel loops over all sites, negative cavity variances can emerge as 
one of the problematic sites gets a too small negative precision because the approximation 
needs to be expanded to cover all the marginal uncertainty in the tilted distributions. 

Skipping updates on the sites with negative cavity v ariances can keep the algo rithm 
numerically stable (for an example of skipping updates see Minka and Laffertv ( 20021 )). but 



it is not enough to ensure convergence. In the Figure [21 the large posterior uncertainty 
about /i and /2 requires very small positive precisions fj for sites 1 and 2. On the other 
hand, large differences between the tilted and marginal variances induce large changes on 
these small site precisions. If the EP updates are not constrained the site parameters may 
start oscillating between too small and too large values and the algorithm never converges 
because of too large update steps. One way to reduce the step length is to use damped 
updates. Decreasing the damping factor 5 sufficiently reduces the oscillations so that the 
algorithm eventually converges but the convergence can be very slow. Example 2 is more 
difficult in the sense that convergence requires damping at least with 6 = 0.5. With the 
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sequential EP the convergence depends also on the update order of the sites and 6 < 0.3 is 
needed for convergence with all permutations. 



Figures 2(c) - (d) illustrate the same approximate tilted distribution as Figures 2(a) -(b) 



but now only a fraction rj = 0.5 of the likelihood terms are included. This corresponds to the 
first round fractional updates on these sites with zero initialization. Because of the flattened 
likelihood p{yi\fi)^p{y2\f2)^ the approximate posterior distribution q{fi, /2|y3, Vn) is still 
two- modal but less sharply peaked compared to the standard EP on the left. It follows 
that also the one-dimensional tilted distributions have smaller variances and the consecutive 
fractional updates (fTSl) of sites 1 and 2 do not widen the marginal variances af and (t| as 
much. The site precisions fi and f2 are larger than with the regular EP and updates on 
them smaller which requires less damping to keep them positive. This is possible because 
the different divergence measure allows for a more localized approximation at 1 < x < 3. 
Figure [T] shows a comparison of the standard (EP) and the fractional EP (fEP, ry = 0.5) with 
MAP estimates of the hyperparameters. In the first example both methods produce very 
similar predictive distribution because the posterior is unimodal. In the second example 
(lower right panel) the fractional EP gives a much smaller predictive uncertainty estimate 
when X = 2 than the standard EP which in turn puts more false posterior mass in the tails 
when compared to the MCMC. 



5.2 Convergence comparisons 

Figure |3] illustrates the convergence properties of the different EP algorithms using the data 
from the example 2. The hyperparameters were set to: = 2, cr = 0.1, age = 3 and //; = 0.88. 
Panel (a) shows the negative marginal likelihood approximation during the first 100 sweeps 
with the sequential EP and the damping set to 6 = 0.8. The panel below shows the site 
precisions corresponding to the observations yi,...,y4 marked in the upper right panel of 
Figure [TJ With this damping level the site parameters keep oscillating with no convergence 
and there are also certain parameter values between iterations 50-60 where the marginal 
likelihood is not defined because of negative cavity precisions (the updates for such sites are 
skipped until next iteration). Whenever fi and f2 become too small they also inflict large 
decrease in the nearby sites 3 and 4. These fluctuations affect other sites the more the larger 
their prior correlations are (defined by the GP prior) with the sites 1 and 2. Panel (b) shows 
the same graphs with larger amount of damping 6 = 0.5. Now the oscillations gradually 
decrease as more iterations are done but convergence is still very slow. Panel (c) shows the 
corresponding data with the parallel EP and same amount of damping. The algorithm does 
not converge and the oscillations are much larger compared to the sequential EP. Also the 
marginal likelihood is not defined at many iterations because of negative cavity precisions. 

Panel (d) in Figure [3] illustrates the convergence of the double- loop algorithm with no 
parallel initialization. There are no oscillations present because the increase of the objec- 
tive (fT3l) is verified at every iteration and sufficient inner-loop optimality is obtained before 
proceeding with the outer-loop minimization. However, compared to the sequential or the 
parallel EP, the convergence is very slow and it takes over 100 iterations to get the site pa- 
rameters to the level that the sequential EP attains with only a couple of iterations. Panels 
(e) shows that much faster convergence can be obtained by initializing with 5 parallel itera- 
tions and then switching to the double-loop algorithm. There is still some slow drift visible 
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(a) Sequential, 5=0.8 (b) Sequential, 5=0.5 (c) Parallel, 6=0.5 




(d) Double-loop (e) 5 parallel + Double-loop (f) Parallel, ti=0.5 
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Figure 3: A convergence comparison between the sequential and parallel EP as well as the 
double-loop algorithm in the example 2 (the right panel in Figure [1]) . For each 
method both the objective — log Zep and the site precisions fj related to data 
points yi, ...,?/4 (see Figure [1]) are shown. See the text for explanation. 



in the site parameters after 20 iterations but changes in the marginal likelihood estimate 
are very small. Small changes in the site parameters indicate differences in the moment 
matching conditions (|10p and consequently also the gradient of the marginal likelihood es- 
timate may be slightly inaccurate if the implicit deri yatives of log Ze p w i th res pect to A_ 



and Xg are assumed zero in the gradient evaluations (lOpper and Wintheii . l2005l ) . Panel (f ) 
shows that the parallel EP converges without damping if fractional updates with r] = 0.5 are 
applied. Because of the different divergence measure the posterior approximation is more 
localized (see Figure [1]) . It follows that the site precisions related to yi and y2 are larger 
and no damping is required to keep them positive during the updates. 

5.3 The marginal likelihood approximation 

Figure m shows contours of the approximate log marginal likelihood with respect to log(/fc) 
and log((Tgg) in the examples of Figure [TJ The contours in the first column are obtained 
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Example 1, EP,ii=1, MAE=0.24 Example 1, EP,ii=0.5, MAE=0.25 Example 1, AIS 




Example 2, EP, 11=1.0, MAE=0. 24 Example 2, EP,ti=0.5, MAE.0.18 Example 2, AIS 




-0.5 0.5 -0.5 0.5 -0.5 0.5 



log(length-scale) 

Figure 4: The approximate log marginal likelihood logp(y | X, 9, v, o"^) as a function of the 
log-length-scale log(^^) and the log-magnitude log(o"^g) in the examples shown 
in the Figure [TJ The marginal likelihood approximation is visualized with both 
the standard EP [r] = 1) and the fractional EP [i] = 0.5). The mode of the 
hyperparameters is marked with x and o for the standard and the fractional 
EP respectively. For comparison the marginal is also approximated by annealed 
importance sampling (AIS). For both the standard and the fractional EP the mean 
absolute errors (MAE) over the region with respect to the AIS estimate are also 
shown. The noise parameter cr^ and the degrees of freedom u are fixed to the 
MAP-estimates obtained with r/ = 1. 



by applying first the sequential EP with 5 = 0.8 and using the double-loop algorithm if it 
does not converge. The hyperparameter values for which the sequential algorithm does not 
converge are marked with black dots and the maximum marginal likelihood estimate of the 
hyperparameters is marked with (x). The second column shows the corresponding results 
obtained with the fractional EP (r/ = 0.5) and the corresponding hyperparameter estimates 
are marked with (o). For comparison, lo g ma r ginal likelihood estimates determined with 
the annealed importance samphng (AIS) (B,B) are shown in the third column. 

In the both examples there is an area of problematic EP updates with smaller length- 
scales which corresponds to the previously discussed ambiguity about the unknown function 
near data points yi and y2 in the Figure [TJ There is also a second area of problematic 
updates at larger length-scale values in example 2. With larger length-scales the model 
is too stiff and it is unable to explain large proportion of the data points in the strongly 
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nonlinear region (—4 < a; < —1) and consequently there exist no unique unimodal solution. 
It is clear that with the first artificial example the optimization of the hyperparameters with 
the sequential EP can fail if not initialized carefully or not enough damping is used. In the 
second example the sequential EP approximation corresponding to the MAP values cannot 
even be evaluated because the mode lies in the area of nonconvergent hyperparameter values. 
In visual comparison with AIS both the standard and fractional EP give very similar and 
accurate approximations in the first example (the contours are drawn at the same levels 
for each method). In the second example there are more visible difi'erences: the standard 
EP tends to overestimate the marginal likelihood due to the larger posterior uncertainties 
(see Figure [1]) whereas fractional EP underestimates it slightly. This is congruent with the 
properties of the diff^erent divergence measure used in the moment matching. The difference 
between the hyperparameter values at the modes between the standard and fractional EP is 
otherwise less than 5% except that in the second example a and u are ca. 30% larger with 
the fractional EP. 



6. Experiments 



Fo ur data se t s are used to compare the approximative methods: 1) An artificial example 
by iFriedmanI (|l99ll ) involving a nonlinear function of 5 inputs. To create a feature selection 
problem, five irrelevant input variables were added to the data. We genera ted 10 data sets 
with 100 training points and 10 randomly selected outliers as described bv lKussI tOQ(h . 2) 
Boston housing data with 506 observations for which the task is to pre dict the med ian house 
prices in the Boston metropolitan area with 13 input variables (see e.g., Kuss, 20061 ). 3) Data 
that involves prediction of con crete quality based on 27 input variables for 215 experiments 
( Vehtari and Lampinen . 20021 ). 4) Data for which the task is to pr edict the co mpressive 
strength of concrete based on 8 input variables for 1030 observations ( Yeh . 19981 ). 



6.1 Predictive comparisons with fixed hyperparameters 

First we compare the quality of the approximate predictive distributions q{f^ | x,,,, P, 9, cj^), 
where x* is the prediction location and /* = /(x*), between all the approximative meth- 
ods. We run a full MCMC on the housing data to determine the posterior mean estimates 
for the hyperparameters. Then the hyperparameters were fixed to these values and a 10- 
fold cross-validation was done with all the approximations including MCMC. The predic- 
tive means and standard deviations of the latent values as well as the predictive densities 
of the test observations obtained with the Laplace's method (LA), EP, fVB, and VB are 
plotted against the MCMC estimate in the Figure [5l Excluding MCMC, the predictive 
densities were approximated by numerically integrating over the Gaussian approximation 
of /* in g(y*| x*,D, ^, I/, cr^) = j p{y^\f^,v,a'^)q{f^\yi^,V,6,i',a'^)df^. EP gives the most 
accurate estimates for all the predictive statistics, and clear differences to the MCMC can 
only be seen in the predictive densities which indicates that accurate mean and variance 
estimates of the latent value may not always be enough when deriving other predictive 
statistics. This contrast somewhat to the corresponding results in GP classification where 
Gaus sian approximation was shown to be very accurate in estimating predictive probabil- 
ities ( Nickisch and Rasmussen . 20081 ). Both fVB and VB approximate the mean well but 
are overconfident in the sense that they underestimate the standard deviations, overesti- 
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E(f.)forLA E(f.)forEP E(f.)forfVB E{f.)forVB 




p(y.)forLA p(y.) for EP p(y.)forfVB p(y.)forVB 




012 012 012 012 



Figure 5: A comparison of the approximative predictive means E(/*| x^:,T>), standard devi- 
ations std(/*| X*, P), and probabilities g(y*|x*,P) provided by the different ap- 
proximation methods using 10-fold cross-validation on the Boston housing data. 
The hyperparameters are fixed to the posterior means obtained by MCMC run 
on all data. Each dot corresponds to one data point for which the x-coordinate is 
the MCMC estimate with the fixed hyperparameter values and the y-coordinate 
the corresponding approximative value obtained with the Laplace's method (LA), 
EP, fVB, or VB. 



mate the larger predictive densities, and underestimate smaller predictive densities. LA 
gives similar mean estimates with the VB approximations but approximates the standard 
deviations slightly better especially with larger values. Put together, all methods provide 
decent estimates with fixed hyperparameters but larger performance differences are possible 
with other hyperparameter values (depending on the non-Gaussianity of the true conditional 
posterior) and especially when the hyperparameters are optimized. 

6.2 Predictive comparisons with estimation of hyperparameter 

In this section we compare the predictive performance of LA, EP, fVB, VB, and MCMC 
with estimation of the hyperparameters. The predictive performance was measured with 
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the mean absolute error (MAE) and the mean log predictive density (MLPD). With the 
Friedman data these are evaluated using a test set of 1000 latent variables for each of 
the 10 simulated data sets. For the Boston housing and concrete quality data 10-fold cross 
validation is used. For the compressive strength data 2-fold cross-validation was used because 
of the large number observations. To assess the significance of the differences between 
model performances, 95% credible int ervals for the MLPD measur es were approximated 
by Bayesian bootstrap as described by I Vehtari and Lampinen (l2002h . Gaussian likelihood 



(GA) is selected as a baseline model for comparisons. With GA, LA, EP, and VB the 
hyperparameters were estimated by optimizing the marginal posterior densities whereas 
with MCMC all parameters were sampled. The fVB approach was implemented following 
Kuss (j2006l ) where the hyperparameters are adapted in the M-step of the EM-algorithm. 



The variational lower bound associated with the M-step was augmented with the same 
hyperpriors that were used with the other methods. 

Since the MAP inference on the degrees of freedom parameter v proved challenging 
due to possible identifiability issues, the LA, EP, fVB, and VB approximations are tested 
with V both fixed to 4 (LAI, EPl, fVBl, VBl) and optimized together with the other 
hyperparameters (LA2, EP2, fVB2, VB2). = 4 was chosen as a robust default alternative 
to the normal distribution which allows for outliers but still has finite variance compared to 
the extremely wide-tailed alternatives with v <2. With EP we also tested a crude but very 
simple approach (from now on EPS) for improving the robustness of the estimation of v. 
We selected 15 values Vj from the interval [1.5,20] linearly in the log-log scale and ran the 
optimization of all the other hyperparameters with v fixed to these values. The conditional 
posterior of the latent values was approximated as 



p(/*|^,x) ^Y^Wjq{i\V,ej,a],Vj), 

3 

where {%,cr|} = aigrcia^Q^^i q{9,a'^\V,Vj) and Wj = q{9j,a'^,iJj\V)/J2k'ii^k,(^k^i'k\1^)- 
This can be viewed as a crude approximation of the integration over u where p{9,a'^\u,T)) 
is assumed to be very narrowly distributed around the mode. This approximation requires 
optimization of 6 and cr^ with all the preselected values of u and to speed up the computations 
9 and o"^ were initialized to the previous mode. 

The squared exponential covariance ([2]) was used for all models. Uniform priors were 
assumed for 6 and cj^ on log-scale and for u on log-log-scale. The input and target variables 
were scaled to zero mean and unit variances. The degrees of freedom i' was initialized to 
4, a to 0.5 and the magnitude cr^g to 1. The optimization was done with different random 
initializations for the length-scales Zi, and the result with the highest posterior marginal 
density q{9,i',a'^\V) was chosen. The MCMC inference on the latent values was done with 
both Gibbs sampling base d on the scale-mixture mode l ([3]) and direct application of the 



scaled HMC as described in lVanhatalo and Vehtaril (120071 ). Sampling of the hyperparameters 



was tested with both slice sampling and HMC. The scale-mixture Gibbs sampling (SM) 
combined with the slice sampling of the hyperparameters resulted in the best mixing of 
the chains and gave the best predictive performance and therefore only those results are 
reported. The convergence and quality of the MCMC runs was checked by both visual 
inspections as well as by calculating the potential scale reduction facto r s, effec t ive nu mber of 
independent samples, and autocorrelation times ( Gelman et al. . 2004 : Gever . 19921 ). Based 
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on the convergence diagnostics, burn-in periods were excluded from the beginning of the 
chains and the remaining draws were thinned to form the final MCMC estimates. 



Figures 6(a) (c) , (e) and (g) show the MLPD values and their 95% credible intervals 



for all methods in the four data sets. The Student-t model performs better with all ap- 
proximations on all data sets compared to the Gaussian model. The differences between 
the approximations are visible but not very clear. To illustrate the differences more clearly 
Figures |6(b)[ |(d)[ |(f)| and |(h)| show the pairwise comparisons of the log posterior predictive 
densities to SM. The mean values of the pairwise differences together with their 95% credible 
intervals are visualized. The Student-t model with the SM implementation is significantly 
better than the Gaussian model with higher than 95% probability in all data sets. SM also 
performs significantly better than all other approximations with the Friedman and com- 
pressive strength data and with the Housing data only EPl is not significantly worse. The 
differences are much smaller with the concrete quality data and there EPl performs actu- 
ally better than SM. One possible explanation for this is a wrong assumption on the noise 
model (evidence for covariate dependent noise was found in other experiments). Another 
possibility is the experimental design used in the data collection; there is one input variable 
that is able to classify large proportion of the observation with very small length scale, and 
sampling of this parameter may lead to worse performance. 

Other pairwise comparisons reveal that either EPl or EP2 is significantly better than 
LA, VB, and fVB in all data sets except the compressive strength data for which significant 
difference is not found when compared to LAI. If the better performing optimization strat- 
egy is selected from either LA, fVB, or VB, LA is better than fVB and VB with the Friedman 
data and the compressive strength data. Between fVB or VB no significant differences were 
found in pairwise comparisons. 

Optimization of v proved challenging and sensitive to the initialization of the hyper- 
parameters. The most difficult was fVB for which u often drifted slowly towa r ds lar ger 
values. This may be due to our implementation that was made following (Kuss, 20061 ) or 
more likely to the EM style optimization of the hyperparameters. With LA, EP, and VB the 
integration over f is redone in the inner-loop for all objective evaluations in the hyperparam- 
eter optimization, whereas with fVB the optimization is pursued with fixed approximation 
q{i\V,e,v,a'^). The EP-based marginal likelihood estimates were most robust with regards 
to the hyper parameter initialization. According to pairwise comparisons LA2 was signifi- 
cantly worse than LAI only in the compressive strength data. EP2 was significantly better 
than EPl in the housing and compressive strength data but significantly worse with the 
housing data. With fVB and VB optimization of v gave significantly better performance 
only with the simulated Friedman data, and significant decrease was observed with VB2 in 
the housing and compressive strength data. In pairwise comparisons, the crude numerical 
integration over v (EPS) was significantly better than EPl and EP2 with the housing and 
compressive strength data, but never significantly worse. These results give evidence that 
the EP approximation is more reliable in the hyperparameter inference because of the more 
accurate marginal likelihood est imates which is in line with the results in GP classification 
( Nickisch and Rasmussen . 20081 ). 

If MAE is considered the Student-t model was significantly better than GA in all other 
data sets except with the concrete quality data, in which case only EPl gave better results. 
If the best performing hyperparameter optimization scheme is selected for each method. 
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Figure 6: Left column: The mean log posterior predictive density (MLPD) and its 95% 
central credible interval. Gaussian observation model (GA) is shown for refer- 
ence. The Student-t model is inferred with LA, EP, fVB, VB, and scale-mixture 
based Gibbs sampling (SM). Number 1 after a method means that u is fixed, 2 
means that it is optimized, and 3 stands for the simple approximative numerical 
integration over u. Right column: Pairwise comparisons of the log posterior 
predictive densities with respect to SM. The mean together with its 95% central 
credible interval are shown. Values greater than zero indicate that a method is 
better than SM. 
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Table 1: Two upper rows: The relative CPU times required for the hyperparameter infer- 
ence. The times are scaled to yield 1 for LAI separately for each of the four data 
sets, and both the relative mean (mean) as well as the maximum (max) over the 
data sets are reported. The third row: The average relative CPU times over the 
four data sets with the hyperparameters fixed to 28 preselected configurations. 





GA 


LAI 


LA2 


EPl 


EP2 


EPS 


fVBl 


fVB2 


VBl 


VB2 


SM 


mean 


0.07 


1.0 


0.8 


0.8 


7.0 


13 


15 


8.9 


1.6 


1.8 


280 


max 


0.09 


1.0 


1.2 


1.1 


16 


26 


39 


22 


3.3 


3.8 


440 


fixed 


0.1 


1.0 


5.5 


2.4 


1.9 





EP is significantly better than the others with all data sets except with the compressive 
strength data in which case the differences were not significant. When compared to SM, EP 
was better with the Friedman and concrete quality data, otherwise no significant differences 
were found. LA was significantly better than fVB and VB in the compressive strength data 
whereas with the simulated Friedman data VB was better than LA and fVB. 

Table [1] summarizes the total CPU times required for the posterior inference (includes 
hyperparameter optimization and the predictions). The CPU times are scaled to give one 
for LAI and both the mean and maximum over the four data sets are reported. The 
running times for the fastest Student-t approximations are roughly 10-fold compared to 
the baseline method GA. EPl, where z/ = 4, is surprisingly fast compared to the LA but 
with the optimization of v (EP2) it gets much slower. This is explained by the increasing 
number of double-loop iterations required to achieve convergence with the larger number of 
difficult posterior distributions as v gets smaller values. The EP3 is clearly more demanding 
compared to EPl or EP2 because the optimization has to be repeated with every preselected 
value of V. The fVB is quite slow compared to LA or VB because of the slowly progressing 
EM-based hyperparameter adaptation. With the LA and VB the running times are quite 
similar with v both fixed and optimized. The running times are suggestive in the sense 
that they depend much on the implementations, convergence thresholds and initial guess of 
the hyperparameters. Table [1] shows also the average relative running times over the four 
data sets (excluding MCMC) with the hyperparameters fixed to 28 different configurations 
(fixed) . The configurations were created by first including the MCMC mean for each data set 
and then generating all combinations of three clearly different values of i/, cr, and Uge around 
the MCMC mean with randomly selected lengthscales. Quite many difficult hyperparameter 
configurations were created which shows in the larger running time with the EP. 



7. Discussion 



Much research has been done on the EP and it has been found very accurate and compu- 
tationally efficient in many practical applications. Non-log-concave site functions may be 
problematic for the EP but it has been utilized and foui id effective f or many potentially dif- 
ficult models such as t he Gaussian mixture likel i hoods ( Kussl . 20061 : Stegle et al. . 20081 ) and 



priors (spike and slab, Hernandez-Lobato et al. . 20081 ). Modifications such as the damping 



and the fractional updates as well as alternative double-loop algorithms have been proposed 
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to improve the stability in difficult cases but the practical implementation issues have not 
been discussed that much. In this work we have given an example of the good predictive 
performance of the EP in a challenging model but also analyzed the convergence problems 
and the proposed improvements from a practical point of view. The Student-t model is 
an interesting example because it gives a natural interpretation for the negative site preci- 
sions as the increase of the posterior uncertainty related to the not-so-clear outliers. On the 
other hand, conflicting outliers in a region with considerable uncertainty about the unknown 
function values may require very small but positive site precisions, which turned out to be 
problematic for the regular EP updates in our experiments. 

The nonlinear GP regression makes the inference problem even more challenging because 
the multimodality of the conditional posterior depends on the hyperparameter values. In 
practical applications, when the hyperparameters are optimized, an estimate of the marginal 
likelihood is important also with the more difficult cases. As we have demonstrated by exam- 
ples, standard EP, unless damped sufficiently, may not converge with the maximum marginal 
likelihood estimate of the hyperparameters, and therefore, one cannot simply discard these 
hyperparameter values. In our examples these situations were related to two modes in the 
conditional posterior (caused by two outliers) quite far away from each other which requires 
a very large local increase of marginal variance from the unimodal posterior approximation. 
(It should also be noted that moderately damped sequential EP worked fine with many other 
multimodal posterior distributions.) The globally unimodal assumption is not the best in 
such cases although the true underlying function is unimodal, but we think that it is im- 
portant to get some kind of posterior approximation. Whether one prefers the possible false 
certainty provided by the Laplace or VB approximations, or the possible false uncertainty 
of the EP, is a matter of taste but we prefer the latter one. 

It is also important that the inference procedure gives some clue of the underlying prob- 
lems so that more elaborate models can be designed. In addition to the examination of the 
posterior approximation, the need for double-loop iterations in our EP implementation may 
be one indication of an unsuitable model. One can also compare the cavity distributions 
which can be regarded as LOO estimates of the latent values. If for certain sites most of the 
LOO information comes from the corresponding site approximations, i.e., the cavity preci- 
sions are close to zero, there is reason to suspect that the approximation is not suitable. Our 
EP implementation enables a robust way of approaching this limit and in case of problems 
one can switch to fractional updates. 

In this work we have focused on the Student-t model and have omitted comparisons to 
other robust observations r nodel s such as the Laplace distribution and mixtures of Gaussians 
( Kussl . 2006 : Steele et al. . 20081 ). Laplace distribution and Gaussian mixtures are compu- 
tationally convenient since the moment evaluations required for the EP updates can be 
done analytically. However, the Student-t model is more general in the sense that it can 
be thought of as an infinite mixture of Gaussians and parameter v can be used to contin- 
uously adjust the degree of robustness. In addition, it contains the Gaussian distribution 
as a special case. The thickness of the tails could be adjusted also with a finite mixture 
of Gaussians but the number of parameters increases with the number of mixture compo- 
nents. Inferring a point estimate of v from the data turned out to be challenging but EP 
was the most consistent approximative method for it in our experiments. We showed that 
fixing = 4 gives a good baseline model for comparisons and we also described a simple 
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grid based approximation for improving the estimation of v based on data. The presented 
modifications for improving the robustness of the parallel EP are general and they could 
be applied also for other non-log-concave likelihoods. The presented EP approach for GP 
regression with a Student-t likelihood will be implemented in the freely available GPstuff 
software package ( |http : //www . Ice .hut . f i/research/mm/gpstuf f / ). 
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